Setup:
Ensure you have completed the necessary setup instructions in
Setup specifics below in order to follow along during
this lecture. We will be using R for the following analyses and you will
need to download certain data files coefHorvath.rds,
methylation.rds, prostate.rds if you want to
run this code too.
# Install BiocManager
install.packages("BiocManager", quiet = TRUE)
# Download and read dependencies file
download.file(
"https://raw.githubusercontent.com/EleanorSC/High-Dimensional-Statistics/main/dependencies.csv",
destfile = 'dependencies.csv'
)
table <- read.table('dependencies.csv')
# Install dependencies using BiocManager
BiocManager::install(table[[1]])
# Create a directory for data files
dir.create("data", showWarnings = FALSE)
# List of data files to download
data_files <- c(
"coefHorvath.rds",
"methylation.rds",
"prostate.rds"
# "scRNAsea_data.rds" NOTE: if extension exercises are reached, load this dataset
)
# Download data files into the "data" directory
for (file in data_files) {
download.file(
url = file.path(
"https://raw.githubusercontent.com/EleanorSC/High-Dimensional-Statistics/main/Data",
file
),
destfile = file.path("data", file)
)
}Descriptions of four research questions and their datasets are given below. Which of these scenarios use high-dimensional data?
Challenge 1 Solution:
After attempting to answer these yourself, click on “Show Answers”.
No. The number of features is relatively small (4 including the response variable since this is an observed variable).
Yes, this is an example of high-dimensional data. There are 200,004 features.
No. The number of features is relatively small (6).
Yes. There are 20,008 features.
Load the dataset using the here package. Ensure the
prostate.rds file is located in the data
folder of your project directory.
Examine the dataset (in which each row represents a single patient) to:
Determine how many observations (n) and features (p) are
available (hint: see the dim() function).
Examine what variables were measured (hint: see the
names() and head() functions).
Plot the relationship between the variables (hint: see the
pairs() function). What problem(s) with high-dimensional
data analysis does this illustrate?
Challenge 2 Solution:
After attempting to answer these yourself, click on “Show Answers”.
# ----------------------------------------------------------------------
# Examine the Dataset
# ----------------------------------------------------------------------
# The dataset represents clinical data where each row corresponds to a single patient.
# Let's first examine the dimensions of the dataset to understand its structure.
# Determine the number of observations (n) and features (p)
dimensions <- dim(prostate) # Returns a vector: [number of rows, number of columns]
print(dimensions) # Print the dimensions## [1] 97 9
# Examine the variables measured in the dataset
variable_names <- names(prostate) # Returns the column names (variables)
print(variable_names)## [1] "lcavol" "lweight" "age" "lbph" "svi" "lcp" "gleason"
## [8] "pgg45" "lpsa"
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
# ----------------------------------------------------------------------
# Visualize Relationships Between Variables
# ----------------------------------------------------------------------
# Plot pairwise relationships between variables using the `pairs()` function.
# This function creates a scatterplot matrix to visualize the relationships
# between all pairs of numeric variables.
pairs(prostate)1). Use the cor() function to examine correlations
between all variables in the prostate dataset. Are some
pairs of variables highly correlated using a threshold of 0.75 for the
correlation coefficients?
2). Use the lm() function to fit univariate regression
models to predict patient age using two variables that are highly
correlated as predictors. Which of these variables are statistically
significant predictors of age? Hint: the summary() function
can help here.
3). Fit a multiple linear regression model predicting patient age using both variables. What happened?
Challenge 3 Solution:
After attempting to answer these yourself, click on “Show Answers”.
1). Using the cor() function to examine
correlations between all variables
# ----------------------------------------------------------------------
# 1. Examine Correlations Between Variables
# ----------------------------------------------------------------------
# Use the cor() function to compute pairwise correlations between all variables
# in the prostate dataset. Identify pairs of variables with a correlation
# coefficient greater than 0.75 (absolute value).
# Compute the correlation matrix for the dataset
cor(prostate)## lcavol lweight age lbph svi lcp
## lcavol 1.0000000 0.194128286 0.2249999 0.027349703 0.53884500 0.675310484
## lweight 0.1941283 1.000000000 0.3075286 0.434934636 0.10877851 0.100237795
## age 0.2249999 0.307528614 1.0000000 0.350185896 0.11765804 0.127667752
## lbph 0.0273497 0.434934636 0.3501859 1.000000000 -0.08584324 -0.006999431
## svi 0.5388450 0.108778505 0.1176580 -0.085843238 1.00000000 0.673111185
## lcp 0.6753105 0.100237795 0.1276678 -0.006999431 0.67311118 1.000000000
## gleason 0.4324171 -0.001275658 0.2688916 0.077820447 0.32041222 0.514830063
## pgg45 0.4336522 0.050846821 0.2761124 0.078460018 0.45764762 0.631528245
## lpsa 0.7344603 0.354120390 0.1695928 0.179809410 0.56621822 0.548813169
## gleason pgg45 lpsa
## lcavol 0.432417056 0.43365225 0.7344603
## lweight -0.001275658 0.05084682 0.3541204
## age 0.268891599 0.27611245 0.1695928
## lbph 0.077820447 0.07846002 0.1798094
## svi 0.320412221 0.45764762 0.5662182
## lcp 0.514830063 0.63152825 0.5488132
## gleason 1.000000000 0.75190451 0.3689868
## pgg45 0.751904512 1.00000000 0.4223159
## lpsa 0.368986803 0.42231586 1.0000000
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## lcavol 1.00 0.19 0.22 0.03 0.54 0.68 0.43 0.43 0.73
## lweight 0.19 1.00 0.31 0.43 0.11 0.10 0.00 0.05 0.35
## age 0.22 0.31 1.00 0.35 0.12 0.13 0.27 0.28 0.17
## lbph 0.03 0.43 0.35 1.00 -0.09 -0.01 0.08 0.08 0.18
## svi 0.54 0.11 0.12 -0.09 1.00 0.67 0.32 0.46 0.57
## lcp 0.68 0.10 0.13 -0.01 0.67 1.00 0.51 0.63 0.55
## gleason 0.43 0.00 0.27 0.08 0.32 0.51 1.00 0.75 0.37
## pgg45 0.43 0.05 0.28 0.08 0.46 0.63 0.75 1.00 0.42
## lpsa 0.73 0.35 0.17 0.18 0.57 0.55 0.37 0.42 1.00
# As seen above, some variables are highly correlated.
# In particular, the correlation between gleason and pgg45 is equal to 0.75
cor_matrix <- cor(prostate, use = "complete.obs")
high_correlations <- which(abs(cor_matrix) > 0.75 & abs(cor_matrix) < 1, arr.ind = TRUE)
# Extract and display the highly correlated variable pairs
high_cor_pairs <- tibble::tibble(
Var1 = rownames(cor_matrix)[high_correlations[, 1]],
Var2 = colnames(cor_matrix)[high_correlations[, 2]],
Correlation = cor_matrix[high_correlations]
)
print(high_cor_pairs)## # A tibble: 2 × 3
## Var1 Var2 Correlation
## <chr> <chr> <dbl>
## 1 pgg45 gleason 0.752
## 2 gleason pgg45 0.752
2). Using the lm() function to predict patient
age using two variables that are highly correlated as
predictors. Which of these variables are statistically
significant predictors of age? (A = gleason + pgg45)
# ----------------------------------------------------------------------
# 2. Fit Univariate Regression Models
# ----------------------------------------------------------------------
# Fitting univariate regression models to predict age using `gleason` and `pgg45` as predictors.
# Use lm() to fit univariate regression models predicting patient age
# using two highly correlated variables. Evaluate statistical significance
# of each predictor using summary().
model_gleason <- lm(age ~ gleason, data = prostate)
model_pgg45 <- lm(age ~ pgg45, data = prostate)
# Summarize results for each model
summary(model_gleason) # Check significance of gleason##
## Call:
## lm(formula = age ~ gleason, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.780 -3.552 1.448 4.220 13.448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.146 6.918 6.525 3.29e-09 ***
## gleason 2.772 1.019 2.721 0.00774 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.209 on 95 degrees of freedom
## Multiple R-squared: 0.0723, Adjusted R-squared: 0.06254
## F-statistic: 7.404 on 1 and 95 DF, p-value: 0.007741
##
## Call:
## lm(formula = age ~ pgg45, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.0889 -3.4533 0.9111 4.4534 15.1822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.08890 0.96758 64.17 < 2e-16 ***
## pgg45 0.07289 0.02603 2.80 0.00619 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.193 on 95 degrees of freedom
## Multiple R-squared: 0.07624, Adjusted R-squared: 0.06651
## F-statistic: 7.84 on 1 and 95 DF, p-value: 0.006189
3). Fitting a multiple linear regression model predicting patient age using both variables
# ----------------------------------------------------------------------
# 3. Fit a Multiple Regression Model
# ----------------------------------------------------------------------
# Fit a multiple regression model predicting patient age using both variables.
# Examine what happens when both correlated variables are included.
# Fit a multiple regression model
model_multivar <- lm(age ~ gleason + pgg45, data = prostate)
# Summarize the multiple regression model
summary(model_multivar)##
## Call:
## lm(formula = age ~ gleason + pgg45, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.927 -3.677 1.323 4.323 14.420
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.95548 9.74316 5.435 4.3e-07 ***
## gleason 1.45363 1.54299 0.942 0.349
## pgg45 0.04490 0.03951 1.137 0.259
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.198 on 94 degrees of freedom
## Multiple R-squared: 0.08488, Adjusted R-squared: 0.06541
## F-statistic: 4.359 on 2 and 94 DF, p-value: 0.01547
Although gleason and pgg45 have
statistically significant univariate effects, this is no longer the case
when both variables are simultaneously included as covariates in a
multivariate regression model.
Including highly correlated variables such as gleason
and pgg45 simultaneously in the same
regression model can lead to problems in fitting a regression model and
interpreting its output.
Although each variable appears to be associated with the response individually, the model cannot distinguish the contribution of each variable to the model. This can also increase the risk of over-fitting since the model may fit redundant variables to noise rather than true relationships.
To allow the information from variables to be included in the same model despite high levels of correlation, we can use dimensionality reduction methods to collapse multiple variables into a single new variable. We can also use modifications to linear regression like regularisation, which we will cover later.
This lecture focuses on statistical methods for visualising and analysing high-dimensional biological data using Bioconductor packages.
Bioconductor is an open-source platform designed for analysing high-throughput genomic data. It provides a variety of useful packages and example datasets.More details and resources are available at: https://www.bioconductor.org/
Bioconductor packages can be installed and managed using the
BiocManager package.
# Let's load the "minfi" package, a Bioconductor package specifically designed
# for analyzing Illumina Infinium DNA methylation arrays.
library("minfi")
browseVignettes("minfi")Next, we load the methylation dataset which represents data collected using Illumina Infinium methylation arrays which are used to examine methylation across the human genome. These data include information collected from the assay as well as associated metadata from individuals from whom samples were taken.
library("here")
library("ComplexHeatmap")
methylation <- readRDS(here("data/methylation.rds"))
head(colData(methylation))methyl_mat <- t(assay(methylation))
## calculate correlations between cells in matrix
cor_mat <- cor(methyl_mat)
cor_mat[1:10, 1:10] # print the top-left corner of the correlation matrix## cg00075967 cg00374717 cg00864867 cg00945507 cg01027739
## cg00075967 1.00000000 -0.54187597 -0.23545127 -0.03328856 0.15814866
## cg00374717 -0.54187597 1.00000000 0.20194482 -0.01824105 -0.11722829
## cg00864867 -0.23545127 0.20194482 1.00000000 -0.07989013 -0.25388089
## cg00945507 -0.03328856 -0.01824105 -0.07989013 1.00000000 0.32991673
## cg01027739 0.15814866 -0.11722829 -0.25388089 0.32991673 1.00000000
## cg01353448 0.55187311 -0.37130021 0.02410416 0.07201148 0.08380460
## cg01584473 -0.52114045 0.30207176 0.49376101 -0.06213370 -0.04972067
## cg01644850 -0.08130044 0.08972931 0.07465884 0.21406721 0.35845482
## cg01656216 -0.27776967 -0.33015779 0.20665253 -0.14690180 -0.49462321
## cg01873645 0.18174953 -0.10881732 -0.02333667 0.17384071 0.18106477
## cg01353448 cg01584473 cg01644850 cg01656216 cg01873645
## cg00075967 0.551873113 -0.52114045 -0.081300436 -0.277769670 0.18174953
## cg00374717 -0.371300212 0.30207176 0.089729312 -0.330157788 -0.10881732
## cg00864867 0.024104157 0.49376101 0.074658838 0.206652529 -0.02333667
## cg00945507 0.072011477 -0.06213370 0.214067207 -0.146901796 0.17384071
## cg01027739 0.083804601 -0.04972067 0.358454820 -0.494623205 0.18106477
## cg01353448 1.000000000 -0.24710600 -0.001904264 -0.003425993 0.07776795
## cg01584473 -0.247106001 1.00000000 0.013019805 0.263537096 -0.01537876
## cg01644850 -0.001904264 0.01301980 1.000000000 -0.331328760 0.19820041
## cg01656216 -0.003425993 0.26353710 -0.331328760 1.000000000 -0.24664137
## cg01873645 0.077767947 -0.01537876 0.198200410 -0.246641372 1.00000000
The assay() function creates a matrix-like object where
rows represent probes for genes and columns represent samples. We
calculate correlations between features in the methylation
dataset and examine the first 100 cells of this matrix. The size of the
dataset makes it difficult to examine in full, a common challenge in
analysing high-dimensional genomics data.
For the following few exercises, we will be working with human DNA methylation (DNAm) data from flow-sorted blood samples, described in data. DNA methylation assays measure, for each of many sites in the genome, the proportion of DNA that carries a methyl mark (a chemical modification that does not alter the DNA sequence). In this case, the methylation data come in the form of a matrix of normalised methylation levels (M-values), where negative values correspond to unmethylated DNA and positive values correspond to methylated DNA. Along with this, we have a number of sample phenotypes (eg, age in years, BMI).
From our methylation.rds data we can extract:
the dimensions of the dataset using dim(). Importantly,
in these objects and data structures for computational biology in R
generally, observations are stored as columns and features (in this
case, sites in the genome) are stored as rows.
This is in contrast to usual tabular data, where features or
variables are stored as columns and observations are stored as rows;
assays, (normalised methylation levels here), using
assay(); sample-level information using
colData().
library("here")
library("minfi")
methylation <- readRDS(here("data/methylation.rds"))
# Check the dimensions of the dataset using dim().
dim(methylation)## [1] 5000 37
# The output shows that the object has dimensions of 5000 × 37,
# indicating 5000 features and 37 observations.
# To extract the matrix of methylation M-values, use the assay() function.
methyl_mat <- assay(methylation)
hist(methyl_mat, xlab = "M-value")## DataFrame with 6 rows and 14 columns
## Sample_Well Sample_Name purity Sex Age
## <character> <character> <integer> <character> <integer>
## 201868500150_R01C01 A07 PCA0612 94 M 39
## 201868500150_R03C01 C07 NKpan2510 95 M 49
## 201868500150_R05C01 E07 WB1148 95 M 20
## 201868500150_R07C01 G07 B0044 97 M 49
## 201868500150_R08C01 H07 NKpan1869 95 F 33
## 201868590193_R02C01 B03 NKpan1850 93 F 21
## weight_kg height_m bmi bmi_clas Ethnicity_wide
## <numeric> <numeric> <numeric> <character> <character>
## 201868500150_R01C01 88.4505 1.8542 25.7269 Overweight Mixed
## 201868500150_R03C01 81.1930 1.6764 28.8911 Overweight Indo-European
## 201868500150_R05C01 80.2858 1.7526 26.1381 Overweight Indo-European
## 201868500150_R07C01 82.5538 1.7272 27.6727 Overweight Indo-European
## 201868500150_R08C01 87.5433 1.7272 29.3452 Overweight Indo-European
## 201868590193_R02C01 87.5433 1.6764 31.1507 Obese Mixed
## Ethnic_self smoker Array Slide
## <character> <character> <character> <numeric>
## 201868500150_R01C01 Hispanic No R01C01 2.01869e+11
## 201868500150_R03C01 Caucasian No R03C01 2.01869e+11
## 201868500150_R05C01 Persian No R05C01 2.01869e+11
## 201868500150_R07C01 Caucasian No R07C01 2.01869e+11
## 201868500150_R08C01 Caucasian No R08C01 2.01869e+11
## 201868590193_R02C01 Finnish/Creole No R02C01 2.01869e+11
# Association between age and DNA methylation:
# The following heatmap summarises age and methylation levels available in the methylation dataset:
library("ComplexHeatmap")
age <- methylation$Age
# sort methylation values by age
order <- order(age)
age_ord <- age[order]
methyl_mat_ord <- methyl_mat[, order]
# plot heatmap
Heatmap(methyl_mat_ord,
cluster_columns = FALSE,
show_row_names = FALSE,
show_column_names = FALSE,
name = "M-value",
row_title = "Feature",
column_title = "Sample",
top_annotation = columnAnnotation(age = age_ord))Challenge 1 Solution:
After attempting to answer these yourself, click on “Show Answers”.
There are several problems with this approach:
1. Multiple Testing Problem: - If we perform 5000 tests for each of 14 variables, even with no true associations, random noise would produce some significant (spurious) results.
2. Small Sample Size - Some covariates may have very small sample sizes (e.g., certain ethnicities), leading to unreliable or spurious findings.
3. Lack of Research Focus - Without a clear research question, interpreting each model becomes ambiguous. - Rationalising findings after the fact can lead to creating unsupported “stories” based solely on significant results.
In the data we have read in, we have a matrix of methylation values \(X\) and a vector of ages, \(y\). One way to model this is to see if we can use age to predict the expected (average) methylation value for sample \(j\) at a given locus \(i\), which can be written as:
\[ E(X_{ij}) = \beta_0 + \beta_1 \text{Age}_j \]
where \(\text{Age}_j\) is the age of sample \(j\). In this model, \(\beta_1\) represents the unit change in mean methylation level for each unit (year) change in age.
For a specific CpG, we can fit this model and extract more
information from the model object. For illustration purposes, we
arbitrarily select the first CpG in the methyl_mat matrix
(the one on its first row).
# Arbitrarily select the first CpG in the methyl_mat matrix (the one on its first row):
age <- methylation$Age
# methyl_mat[1, ] indicates that the 1st CpG will be used as outcome variable
lm_age_methyl1 <- lm(methyl_mat[1, ] ~ age)
lm_age_methyl1##
## Call:
## lm(formula = methyl_mat[1, ] ~ age)
##
## Coefficients:
## (Intercept) age
## 0.902334 0.008911
We now have estimates for the expected methylation level when age equals 0 (the intercept) and the change in methylation level for a unit change in age (the slope). We could plot this linear model:
plot(age, methyl_mat[1, ], xlab = "Age", ylab = "Methylation level", pch = 16)
abline(lm_age_methyl1)#A scatter plot of age versus the methylation level for an arbitrarily selected CpG side (the one stored as the first column of methyl_mat). Each dot represents an individual. The black line represents the estimated linear model.For this feature, we can see that there is no strong relationship between methylation and age. We could try to repeat this for every feature in our dataset; however, we have a lot of features! We need an approach that allows us to assess associations between all of these features and our outcome while addressing the three considerations we outlined previously. Before we introduce this approach, let’s go into detail about how we generally check whether the results of a linear model are statistically significant.
Using the linear model above, we can test whether age affects methylation values for a given CpG via hypothesis testing. Specifically, we test the null hypothesis that \(\beta_1\), the regression coefficient for age, equals zero. If \(\beta_1 = 0\), there is no linear relationship between age and methylation level at the CpG.
The linear model output provides results associated with this null hypothesis by default. It compares observed results (estimated coefficients) to what we might expect under the null hypothesis if the experiment were repeated many times.
# For this linear model, we can use tidy() from the broom package to extract detailed
# information about the coefficients and the associated hypothesis tests in this model:
library("broom")
tidy(lm_age_methyl1)## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.902 0.344 2.62 0.0129
## 2 age 0.00891 0.0100 0.888 0.381
For coefficient \(k\) (e.g., slope), the t-statistic is calculated as:
\[ t_k = \frac{\hat{\beta}_k}{SE(\hat{\beta}_k)} \]
Here, \(SE(\hat{\beta}_k)\) quantifies uncertainty in the effect size estimate. Knowing the null distribution of \(t_k\) helps us assess the likelihood of observing these results if the null hypothesis were true.
Challenge 2 Solution:
After attempting to answer these yourself, click on “Show Answers”.
The intercept estimate of 0.902 represents the mean methylation value for the first CpG when age is zero. However, this is not meaningful since there are no observations with age below 20 in the dataset. The p-value tests whether the intercept (\(\beta_0\)) is zero, but this is irrelevant to the analysis as we are more interested in the regression coefficient for age (\(\beta_1\)), which indicates whether there is a linear relationship between age and methylation.
Try fitting a linear model using smoking status as a covariate
instead of age, and create a volcano plot. Note: Smoking status is
stored as methylation$smoker.
In the lecture example, we saw that information sharing can result in larger p-values. Why might this be preferable?
Challenge 3 Solution:
After attempting to answer these yourself, click on “Show Answers”.
1. Fitting linear model using smoking status as a covariate instead of age:
# ----------------------------------------------------------------------
# CHALLENGE 3
# ----------------------------------------------------------------------
# 1.
design_smoke <- model.matrix(~methylation$smoker)
fit_smoke <- lmFit(methyl_mat, design = design_smoke)
fit_smoke <- eBayes(fit_smoke)
toptab_smoke <- topTable(fit_smoke, coef = 2, number = nrow(fit_smoke))
plot(toptab_smoke$logFC, -log10(toptab_smoke$P.Value),
xlab = "Effect size", ylab = bquote(-log[10](p)),
pch = 19
)2. Information sharing can result in larger p-values. Why might this be preferable?:
Being more conservative when identifying features can help reduce false discoveries. It is also important to be cautious when rejecting the null hypothesis based on small standard errors caused by abnormally low variability for certain features.
With a large number of features, it is important to determine which features are “interesting” or “significant” for further study.
Using a standard significance threshold of 0.05 may lead to many false positives. A p-value threshold of 0.05 means there is a 1 in 20 chance of observing results as extreme or more extreme under the null hypothesis (no association between age and methylation).
If we perform many more than 20 tests, we are likely to observe significant p-values purely due to random chance, even when the null hypothesis is true.
To illustrate this, we can permute (scramble) the age values and rerun the test to see how random chance affects the results.
# ----------------------------------------------------------------------
# Permute (scramble) the age values and rerun the test to see how random chance affects the results.
set.seed(123)
age_perm <- age[sample(ncol(methyl_mat), ncol(methyl_mat))]
design_age_perm <- model.matrix(~age_perm)
fit_age_perm <- lmFit(methyl_mat, design = design_age_perm)
fit_age_perm <- eBayes(fit_age_perm)
toptab_age_perm <- topTable(fit_age_perm, coef = 2, number = nrow(fit_age_perm))
plot(toptab_age_perm$logFC, -log10(toptab_age_perm$P.Value),
xlab = "Effect size", ylab = bquote(-log[10](p)),
pch = 19
)
abline(h = -log10(0.05), lty = "dashed", col = "red")A random sequence of ages was generated, so no true association between methylation levels and this random sequence is expected. Despite this, many features still show p-values below the traditional significance threshold of p = 0.05. In this example, 226 features are significant at p < 0.05, which demonstrates how using this threshold in a real experiment could falsely identify features as associated with age purely by chance.
When performing multiple tests, features are classified as either “significant” or “non-significant,” but some classifications will inevitably be incorrect. Results fall into four categories:
At a significance level of 5%, 5% of results will be false positives (false discoveries) by chance, as p-values are uniformly distributed under the null hypothesis.
To control false discoveries, the Bonferroni correction can be applied:
Divide the significance threshold by the number of tests (n). Alternatively, multiply the p-values by the number of tests. However, Bonferroni is very conservative, especially with a large number of features.
# BONFERRONI correction – which divides the significance level by
# the number of tests performed (n); Family-Wise Error Rate (fwer)
p_raw <- toptab_age$P.Value
p_fwer <- p.adjust(p_raw, method = "bonferroni")
plot(p_raw, p_fwer, pch = 16, log="xy")
abline(0:1, lty = "dashed")
abline(v = 0.05, lty = "dashed", col = "red")
abline(h = 0.05, lty = "dashed", col = "red")At a significance level of 0.05, with 100 tests, what is the Bonferroni significance threshold?
In a gene expression experiment, after FDR correction with an adjusted p-value threshold of 0.05, 500 significant genes are observed. What proportion of these genes are truly different?
Challenge 4 Solution:
After attempting to answer these yourself, click on “Show Answers”.
1. At a significance level of 0.05, with 100 tests, what is the Bonferroni significance threshold?
2. In a gene expression experiment, after FDR correction with an adjusted p-value threshold of 0.05, 500 significant genes are observed. What proportion of these genes are truly different?
# ----------------------------------------------------------------------
# CHALLENGE 4
# ----------------------------------------------------------------------
# 2. We can’t say what proportion of these genes are truly different. However, if we repeated this
# experiment and statistical test over and over, on average 5% of the results from each run
# would be false discoveries.
# Try applying FDR correction to the p_raw vector.
# Hint: Use the p.adjust() function and check help("p.adjust") for details on the method.
# A = The following code runs FDR correction and compares it to non-corrected values and to Bonferroni:
p_fdr <- p.adjust(p_raw, method = "BH")
plot(p_raw, p_fdr, pch = 16, log="xy")
abline(0:1, lty = "dashed")
abline(v = 0.05, lty = "dashed", col = "red")
abline(h = 0.05, lty = "dashed", col = "red")# plot of chunk plot-fdr-fwer
plot(p_fwer, p_fdr, pch = 16, log="xy")
abline(0:1, lty = "dashed")
abline(v = 0.05, lty = "dashed", col = "red")
abline(h = 0.05, lty = "dashed", col = "red")Regularisation, also called penalised regression, is used for prediction and feature selection, particularly in high-dimensional data. It stabilises coefficient estimates, enabling models to handle more features than observations.
Standard linear models fail in high-dimensional settings because they cannot fit cases with more predictors than data points. While previous methods shared information across models, regularisation provides a faster and efficient alternative.
Next, we’ll explore this by applying regularisation to methylation data.
# ----------------------------------------------------------------------
# REGULARISATION
# ----------------------------------------------------------------------
library("here")
library("minfi")
methylation <- readRDS(here("data/methylation.rds"))
## here, we transpose the matrix to have features as rows and samples as columns
methyl_mat <- t(assay(methylation))
age <- methylation$AgeThen, we try to fit a model with outcome age and all 5,000 features in this dataset as predictors (average methylation levels, M-values, across different sites in the genome).
# by using methyl_mat in the formula below, R will run a multivariate regression
# model in which each of the columns in methyl_mat is used as a predictor.
fit <- lm(age ~ methyl_mat)You can see that we’re able to get some effect size estimates, but they seem very high! The summary also says that we were unable to estimate effect sizes for 4,964 features because of “singularities”. We clarify what singularities are in the note below but this means that R couldn’t find a way to perform the calculations necessary to fit the model. Large effect sizes and singularities are common when naively fitting linear regression models with a large number of features (i.e., to high-dimensional data), often since the model cannot distinguish between the effects of many, correlated features or when we have more features than observations.
A singularity occurs when a matrix used in a linear model cannot be inverted, which is necessary for calculating the model’s coefficients. This happens if:
There are more features (predictors) than observations. The features are highly correlated. In such cases, R cannot perform the required calculations and returns an error about singularities.
## [1] 0
Consider the following questions:
Why would we observe correlated features in high-dimensional biological data?
Why might correlated features be a problem when fitting linear models?
What issue might correlated features present when selecting features to include in a model one at a time?
Challenge 1 Solution:
After attempting to answer these yourself, click on “Show Answers”.
1. Why would we observe correlated features in high-dimensional biological data? Many of the features in biological data represent very similar information biologically. For example, sets of genes that form complexes are often expressed in very similar quantities. Similarly, methylation levels at nearby sites are often very highly correlated.
2.Why might correlated features be a problem when fitting linear models? Correlated features can make inference unstable or even impossible mathematically.
3. What issue might correlated features present when selecting features to include in a model one at a time? When we are selecting features one at a time we want to pick the most predictive feature each time. When a lot of features are very similar but encode slightly different information, which of the correlated features we select to include can have a huge impact on the later stages of model selection!
Challenge 2 Solution:
After attempting to answer these yourself, click on “Show Answers”.
1. What are we minimising when we fit a linear
model?
When we fit a linear model we are minimising the squared error!
2. Why are we minimising this objective? What assumptions are we making about our data when we do so?
We minimise squared error because it ignores residual signs, penalises large errors more heavily, and is computationally simple. Least squares assumes residuals are normally distributed with constant variance after accounting for linear relationships.
Sets of models are often compared using statistics such as adjusted R^2, AIC or BIC. These show us how well the model is learning the data used in fitting that same model 1. However, these statistics do not really tell us how well the model will generalise to new data. This is an important thing to consider – if our model doesn’t generalise to new data, then there’s a chance that it’s just picking up on a technical or batch effect in our data, or simply some noise that happens to fit the outcome we’re modelling. This is especially important when our goal is prediction – it’s not much good if we can only predict well for samples where the outcome is already known, after all!
To get an idea of how well our model generalises, we can split the data into two - a “training” and a “test” set. We use the “training” data to fit the model, and then see its performance on the “test” data.
One thing that often happens in this context is that large coefficient values minimise the training error, but they don’t minimise the test error on unseen data. First, we’ll go through an example of what exactly this means.
To compare the training and test errors for a model of methylation features and age, we’ll split the data into training and test sets, fit a linear model and calculate the errors. First, let’s calculate the training error. Let’s start by splitting the data into training and test sets:
methylation <- readRDS(here::here("data/methylation.rds"))
library("SummarizedExperiment")
age <- methylation$Age
methyl_mat <- t(assay(methylation))Subset significant CpGs:
cpg_markers <- c("cg16241714", "cg14424579", "cg22736354", "cg02479575", "cg00864867",
"cg25505610", "cg06493994", "cg04528819", "cg26297688", "cg20692569",
"cg04084157", "cg22920873", "cg10281002", "cg21378206", "cg26005082",
"cg12946225", "cg25771195", "cg26845300", "cg06144905", "cg27377450"
)
horvath_mat <- methyl_mat[, cpg_markers]
## Generate an index to split the data
set.seed(42)
train_ind <- sample(nrow(methyl_mat), 25)
## Split the data
train_mat <- horvath_mat[train_ind, ]
train_age <- age[train_ind]
test_mat <- horvath_mat[-train_ind, ]
test_age <- age[-train_ind]Now let’s fit a linear model to our training data and calculate the training error. Here we use the mean of the squared difference between our predictions and the observed data, or “mean squared error” (MSE).
## Fit a linear model
# as.data.frame() converts train_mat into a data.frame
# Using the `.` syntax above together with a `data` argument will lead to
# the same result as using `train_age ~ train_mat`: R will fit a multivariate
# regression model in which each of the columns in `train_mat` is used as
# a predictor. We opted to use the `.` syntax because it will help us to
# obtain model predictions using the `predict()` function.
fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat))
## Function to calculate the (mean squared) error
mse <- function(true, prediction) {
mean((true - prediction)^2)
}
## Calculate the training error
err_lm_train <- mse(train_age, fitted(fit_horvath))
err_lm_train## [1] 1.319628
Challenge 3 Solution:
After attempting to answer these yourself, click on “Show Answers”.
1. Calculate the mean squared error for the test set:
First, let’s find the predicted values for the ‘unseen’ test data:
The mean squared error for the test set is the mean of the squared error between the predicted values and true test data.
## [1] 223.3571
Unfortunately, the test error is a lot higher than the training error. If we plot true age against predicted age for the samples in the test set, we can gain more insight into the performance of the model on the test set. Ideally, the predicted values should be close to the test data.
Ridge regression adds a penalty to the sum of squared coefficients (L2 norm) to control model complexity. This penalty, scaled by a factor λ, balances reducing large coefficients and fitting the data.
\[ \sum_{i=1}^N (y_i - x'_i \beta)^2 + \lambda \|\beta\|_2^2 \]
We’ll use the glmnet package to compare regularized and ordinary
least squares models using a subset of 20 features
(cpg_markers) identified earlier.
library("glmnet")
## glmnet() performs scaling by default, supply un-scaled data:
horvath_mat <- methyl_mat[, cpg_markers] # select the same 20 sites as before
train_mat <- horvath_mat[train_ind, ] # use the same individuals as selected before
test_mat <- horvath_mat[-train_ind, ]
ridge_fit <- glmnet(x = train_mat, y = train_age, alpha = 0)
plot(ridge_fit, xvar = "lambda")
abline(h = 0, lty = "dashed")
Since we split the data into test and training data, we can prove that
ridge regression predicts the test data better than the model with no
regularisation. Let’s generate our predictions under the ridge
regression model and calculate the mean squared error in the test
set:
# Obtain a matrix of predictions from the ridge model,
# where each column corresponds to a different lambda value
pred_ridge <- predict(ridge_fit, newx = test_mat)
# Calculate MSE for every column of the prediction matrix against the vector of true ages
err_ridge <- apply(pred_ridge, 2, function(col) mse(test_age, col))
min_err_ridge <- min(err_ridge)
# Identify the lambda value that results in the lowest MSE (ie, the "best" lambda value)
which_min_err <- which.min(err_ridge)
pred_min_ridge <- pred_ridge[, which_min_err]
## Return errors
min_err_ridge## [1] 46.76802
This is much lower than the test error for the model without regularisation:
## [1] 223.3571
We can see where on the continuum of lambdas we’ve picked a model by plotting the coefficient paths again. In this case, we’ve picked a model with fairly modest coefficient shrinking.
chosen_lambda <- ridge_fit$lambda[which.min(err_ridge)]
plot(ridge_fit, xvar = "lambda")
abline(v = log(chosen_lambda), lty = "dashed")Which performs better, ridge or OLS?
Plot predicted ages for each method against the true ages. How do the predictions look for both methods? Why might ridge be performing better?
Challenge 4 Solution:
After attempting to answer these yourself, click on “Show Answers”.
1. Which performs better, ridge or OLS?
Ridge regression performs significantly better on unseen data, despite being “worse” on the training data.
## [1] 46.76802
## [1] 223.3571
2. Plot predicted ages for each method against the true ages. How do the predictions look for both methods? Why might ridge be performing better?
The ridge ones are much less spread out with far fewer extreme predictions.
all <- c(pred_lm, test_age, pred_min_ridge)
lims <- range(all)
par(mfrow = 1:2)
plot(test_age, pred_lm,
xlim = lims, ylim = lims,
pch = 19
)
abline(coef = 0:1, lty = "dashed")
plot(test_age, pred_min_ridge,
xlim = lims, ylim = lims,
pch = 19
)
abline(coef = 0:1, lty = "dashed")#Two plots showing OLS predictions (left) and ridge regression predictions (right) of age (y) against true age (x).Why Ridge Performs Better:
Ridge reduces the variance of the model by stabilising coefficient estimates.
It prioritizes generalisability, producing predictions that are more robust to noise.
While OLS attempts to minimize the residual sum of squares without constraints, ridge adds a penalty term, effectively balancing model fit and simplicity.
LASSO regularization uses the L1 norm (sum of absolute coefficient values) to shrink coefficients:
\[ \| \beta \|_1 = \sum_{j=1}^p |\beta_j| \]
This approach encourages sparse models, removing unnecessary features by shrinking some coefficients exactly to zero. The sharp boundaries of the restricted region make it more likely that solutions lie at corners, resulting in simpler models.
To balance model complexity and information retention, we choose an optimal λ. A common method is cross-validation, which splits the data into K chunks. K−1 chunks are used for training, and the remaining chunk is for testing. This process rotates through all chunks, providing a reliable estimate of how well each λ value generalizes to new data.
We can use this new idea to choose a lambda value by finding the lambda that minimises the error across each of the test and training splits. In R:
# fit lasso model with cross-validation across a range of lambda values
lasso <- cv.glmnet(methyl_mat, age, alpha = 1)
plot(lasso)# Extract the coefficients from the model with the lowest mean squared error from cross-validation
coefl <- coef(lasso, lasso$lambda.min)
# select only non-zero coefficients
selection <- which(coefl != 0)
# and convert to a normal matrix
selected_coefs <- as.matrix(coefl)[selection, 1]
selected_coefs## (Intercept) cg02388150 cg06493994 cg22449114 cg22736354
## -8.4133296328 0.6966503392 0.1615535465 6.4255580409 12.0507794749
## cg03330058 cg09809672 cg11299964 cg19761273 cg26162695
## -0.0002362055 -0.7487594618 -2.0399663416 -5.2538055304 -0.4486970332
## cg09502015 cg24771684 cg08446924 cg13215762 cg24549277
## 1.0787003366 4.5743800395 -0.5960137381 0.1481402638 0.6290915767
## cg12304482 cg13131095 cg17962089 cg13842639 cg04080666
## -1.0167896196 2.8860222552 6.3065284096 0.1590465147 2.4889065761
## cg06147194 cg03669936 cg14230040 cg19848924 cg23964682
## -0.6838637838 -0.0352696698 0.1280760909 -0.0006938337 1.3378854603
## cg13578928 cg02745847 cg17410295 cg17459023 cg06223736
## -0.8601170264 2.2346315955 -2.3008028295 0.0370389967 1.6158734083
## cg06717750 cg20138604 cg12851161 cg20972027 cg23878422
## 2.3401693309 0.0084327521 -3.3033355652 0.2442751682 1.1059030593
## cg16612298 cg03762081 cg14428146 cg16908369 cg16271524
## 0.0050053190 -6.5228858163 0.3167227488 0.2302773154 -1.3787104336
## cg22071651 cg04262805 cg24969251 cg11233105 cg03156032
## 0.3480551279 1.1841804186 8.3024629942 0.6130598151 -1.1121959544
We can see that cross-validation has selected a value of λ resulting in 44 features and the intercept.
Elastic net regression combines the properties of ridge (α = 0) and LASSO (α = 1) regression, blending their advantages. It drops uninformative variables like LASSO while maintaining conservative coefficient estimates like ridge, leading to improved predictions.
Elastic net optimizes the following: \[ \sum_{i=1}^N \left( y_i - x'_i\beta \right)^2 + \lambda \left( \alpha \|\beta\|_1 + (1-\alpha) \|\beta\|_2^2 \right) \]
This flexibility allows elastic net to balance variable selection and prediction accuracy. Contour plots help visualize how the penalty changes for different α values.
Fit an elastic net model (hint: alpha = 0.5) without cross-validation and plot the model object.
Fit an elastic net model with cross-validation and plot the error. Compare with LASSO.
Challenge 5 Solution:
After attempting to answer these yourself, click on “Show Answers”.
1. Fit an elastic net model (hint: alpha = 0.5) without cross-validation and plot the model object.
2. Fit an elastic net model with cross-validation and plot the error. Compare with LASSO.
The process of model selection is similar for elastic net models as for LASSO models.
The plot here depicts mean squared error (MSE) against discrete values of lambda, with red points showing the average mean squared error and grey error bars showing the 1 standard error interval around them.
The MSE rises as log lambda increases, indicating a larger prediction error. Two dashed lines indicate the lambda value corresponding to the lowest overall MSE value and the lambda value corresponding to the lambda value with MSE with one standard error of the minimum.
K-Means Clustering:
Partitional method that divides data into k predefined clusters based on minimizing variance within clusters. Best for large datasets with clearly defined spherical clusters.
Hierarchical Clustering:
Builds a tree (dendrogram) of nested clusters based on similarity. Useful for small datasets or when exploring hierarchical relationships.
Which to use?
Use K-means for speed and scalability; use hierarchical for interpretability and when the number of clusters is unknown.
Cluster analysis involves finding groups of observations that are more similar to each other (according to some feature) than they are to observations in other groups and are thus likely to represent the same source of heterogeneity. Once groups (or clusters) of observations have been identified using cluster analysis, further analyses or interpretation can be carried out on the groups, for example, using metadata to further explore groups.
Note: Cluster analysis is commonly used to discover unknown groupings in fields such as bioinformatics, genomics, and image processing, in which large datasets that include many features are often produced.
There are various ways to look for clusters of observations in a dataset using different clustering algorithms. One way of clustering data is to minimise distance between observations within a cluster and maximise distance between proposed clusters. Using this process, we can also iteratively update clusters so that we become more confident about the shape and size of the clusters.
K-means clustering groups data points into a user-defined number of distinct, non-overlapping clusters. To create clusters of ‘similar’ data points, K-means clustering creates clusters that minimise the within-cluster variation and thus the amount that data points within a cluster differ from each other. The distance between data points within a cluster is used as a measure of within-cluster variation.
To carry out K-means clustering, we first pick k initial points as centres or “centroids” of our clusters. There are a few ways to choose these initial “centroids” and this is discussed below. Once we have picked intitial points, we then follow these two steps until appropriate clusters have been formed:
1. Assign each data point to the cluster with the closest centroid.
2. Update centroid positions as the average of the points in that cluster.
While K-means has some advantages over other clustering methods (easy to implement and to understand), it does have some disadvantages, particularly difficulties in identifying initial clusters which observations belong to and the need for the user to specify the number of clusters that the data should be partitioned into.
Let’s carry out K-means clustering in R using some real
high-dimensional data. We’re going to work with single-cell RNA
sequencing data, scRNAseq_data, in these clustering
challenges, which is often very high-dimensional. Commonly, experiments
profile the expression level of 10,000+ genes in thousands of cells.
Even after filtering the data to remove low quality observations, the
dataset we’re using in this episode contains measurements for over 9,000
genes in over 3,000 cells.
One way to get a handle on a dataset of this size is to use something we covered earlier in the course - dimensionality reduction. Dimensionality reduction allows us to visualise this incredibly complex data in a small number of dimensions. In this case, we’ll be using principal component analysis (PCA) to compress the data by identifying the major axes of variation in the data, before running our clustering algorithms on this lower-dimensional data to explore the similarity of features.
The scater package has some easy-to-use tools to
calculate a PCA for SummarizedExperiment objects. Let’s
load the scRNAseq_data and calculate some principal
components.
library("SingleCellExperiment")
library("scater")
scrnaseq <- readRDS(here::here("data/scRNAseq_data.rds"))
scrnaseq <- runPCA(scrnaseq, ncomponents = 15)
pcs <- reducedDim(scrnaseq)[, 1:2]The first two principal components capture almost 50% of the variation within the data. For now, we’ll work with just these two principal components, since we can visualise those easily, and they’re a quantitative representation of the underlying data, representing the two largest axes of variation.
We can now run K-means clustering on the first and second principal
components of the scRNAseq_data using the
kmeans() function.
set.seed(42)
cluster <- kmeans(pcs, centers = 4)
scrnaseq$kmeans <- as.character(cluster$cluster)
plotReducedDim(scrnaseq, "PCA", colour_by = "kmeans")We can see that this produces a sensible-looking partition of the data. However, is it totally clear whether there might be more or fewer clusters here?
plotReducedDim(). Save this with a variable name
that’s different to what we just used, because we’ll use this again
later.Extension Challenge 1 Solution:
Above is a problem for you to solve; when you’re ready and have had an attempt yourself, click on “Show Answer” to reveal the code.
When performing clustering, it is important for us to be able to measure how well our clusters are separated. One measure to test this is silhouette width, which measures how similar a data point is to points in the same cluster compared to other clusters. The silhouette width is computed for every observation and can range from -1 to 1. A high silhouette width means an observation is closer to other observations within the same cluster. For each cluster, the silhouette widths can then be averaged or an overall average can be taken.
More detail on silhouette widths: In more detail, each observation’s silhouette width is computed as follows:
Compute the average distance between the focal observation and all other observations in the same cluster.
For each of the other clusters, compute the average distance between focal observation and all observations in the other cluster. Keep the smallest of these average distances.
Subtract (1.)-(2.) then divivde by whichever is smaller (1.) or (2).
Ideally, we would have only large positive silhouette widths, indicating that each data point is much more similar to points within its cluster than it is to the points in any other cluster. However, this is rarely the case. Often, clusters are very fuzzy, and even if we are relatively sure about the existence of discrete groupings in the data, observations on the boundaries can be difficult to confidently place in either cluster.
Here we use the silhouette() function from the
cluster package to calculate the silhouette width of our
K-means clustering using a distance matrix of distances between points
in the clusters.
# Visualizing Silhouette Widths for K-means Clustering
# Import the required package
library("cluster")
# Compute the distance matrix for principal components
dist_mat <- dist(pcs)
# Calculate silhouette widths for K-means clustering
sil <- silhouette(cluster$cluster, dist = dist_mat)
# Plot the silhouette widths
plot(sil,
main = "Silhouette Widths for K-means Clustering", # Improved and clear title
border = NA, # Remove borders for better aesthetics
col = rainbow(length(unique(cluster$cluster)))
) # Color each cluster uniquelyLet’s plot the silhouette score on the original dimensions used to cluster the data. Here, we’re mapping cluster membership to point shape, and silhouette width to colour.
pc <- as.data.frame(pcs)
colnames(pc) <- c("x", "y")
pc$sil <- sil[, "sil_width"]
pc$clust <- factor(cluster$cluster)
mean(sil[, "sil_width"])## [1] 0.7132479
library(ggplot2)
ggplot(pc) +
aes(x = x, y = y, shape = as.factor(clust), colour = sil) +
geom_point(size = 2, alpha = 0.9) + # Adjust size and transparency for better visibility
scale_colour_gradient2(
low = "skyblue", mid = "#8494FF", high = "tomato", midpoint = 0.4,
name = "Silhouette\nScore" # Add a descriptive legend title
) +
scale_shape_manual(
values = setNames(16:19, 1:4), # Use filled shapes for better visibility
name = "Cluster" # Add a legend title
) +
labs(
title = "Cluster Visualization with Silhouette Scores",
subtitle = "Scatter plot coloured according to silhouette width, grouped according to cluster membership",
x = "Principal Component 1",
y = "Principal Component 2"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 10),
axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 10)
)This plot shows that silhouette values for individual observations tends to be very high in the centre of clusters, but becomes quite low towards the edges. This makes sense, as points that are “between” two clusters may be more similar to points in another cluster than they are to the points in the cluster one they belong to.
Calculate the silhouette width for the k of 5 clustering we did earlier. Are 5 clusters appropriate? Why/why not?
Can you identify where the differences lie?
1. Calculating the silhoutte width; if the average silhouette width is close to or above 0.5, the clustering is generally appropriate. If below 0.5, it suggests poor cluster structure.
Here, sil_width = 0.58; therefore, clustering generally
appropriate.
sil5 <- silhouette(cluster5$cluster, dist = dist_mat)
scrnaseq$kmeans5 <- as.character(cluster5$cluster)
plotReducedDim(scrnaseq, "PCA", colour_by = "kmeans5")#Scatter plot of principal component 2 against principal component 1
#in the scRNAseq data, coloured by clusters produced by k-means clustering.
mean(sil5[, "sil_width"])## [1] 0.5914255
2. The average silhouette width is lower when k=5.
This seems to be because some observations in clusters 3 and 5 seem to be more similar to other clusters than the one they have been assigned to. This may indicate that k is too high.
Gap statistic:
Another measure of how good our clustering is is the “gap statistic”. This compares the observed squared distance between observations in a cluster and the centre of the cluster to an “expected” squared distances. The expected distances are calculated by randomly distributing cells within the range of the original data. Larger values represent lower squared distances within clusters, and thus better clustering. We can see how this is calculated in the following example.
library("cluster")
gaps <- clusGap(pcs, kmeans, K.max = 20, iter.max = 20)
best_k <- maxSE(gaps$Tab[, "gap"], gaps$Tab[, "SE.sim"])
best_k## [1] 5
plot(gaps$Tab[,"gap"], xlab = "Number of clusters", ylab = "Gap statistic")
abline(v = best_k, col = "red")When we cluster data, we want to be sure that the clusters we identify are not a result of the exact properties of the input data. That is, we want to ensure that the clusters identified do not change substantially if the observed data change slightly. This makes it more likely that the clusters can be reproduced.
To assess this, we can bootstrap. We first resample the data with replacement to reproduce a ‘new’ data set. We can then calculate new clusters for this data set and compare these to those on the original data set, thus helping us to see how the clusters may change for small changes in the data. This is maybe easier to see with an example. First, we define some data:
Then, we can take a sample from this data without replacement:
## [1] 1 2 3 5 4
This sample is a subset of the original data, and points are only present once. This is the case every time even if we do it many times:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 5 4 2 2 2 5 4 1 2 5
## [2,] 3 5 1 1 3 1 5 4 3 1
## [3,] 4 3 4 3 5 2 3 3 4 3
## [4,] 2 2 3 4 4 4 2 2 5 4
## [5,] 1 1 5 5 1 3 1 5 1 2
However, if we sample with replacement, then sometimes individual data points are present more than once.
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 5 3 1 5 1 1 4 4 5 4
## [2,] 1 3 5 1 1 3 5 1 4 2
## [3,] 5 3 2 5 3 1 2 4 3 2
## [4,] 2 3 5 2 2 4 2 3 1 4
## [5,] 3 2 2 1 3 5 5 1 1 4
Bootstrapping is a powerful and common statistical technique.
We would like to know about the sampling distribution of a statistic, but we don’t have any knowledge of its behaviour under the null hypothesis.
For example, we might want to understand the uncertainty around an estimate of the mean of our data. To do this, we could resample the data with replacement and calculate the mean of each average.
boots <- replicate(1000, mean(sample(data, 5, replace = TRUE)))
hist(boots,
breaks = "FD",
main = "1,000 bootstrap samples",
xlab = "Mean of sample"
)In this case, the example is simple, but it’s possible to devise more complex statistical tests using this kind of approach.
The bootstrap, along with permutation testing, can be a very flexible and general solution to many statistical problems.
In applying the bootstrap to clustering, we want to see two things:
1. Will observations within a cluster consistently cluster together in different bootstrap replicates?
2. Will observations frequently swap between clusters?
In the plot below, the diagonal of the plot shows how often the clusters are reproduced in boostrap replicates. High scores on the diagonal mean that the clusters are consistently reproduced in each boostrap replicate. Similarly, the off-diagonal elements represent how often observations swap between clusters in bootstrap replicates. High scores indicate that observations rarely swap between clusters.
library("pheatmap")
library("bluster")
library("viridis")
km_fun <- function(x) {
kmeans(x, centers = 4)$cluster
}
ratios <- bootstrapStability(pcs, FUN = km_fun, clusters = cluster$cluster)
pheatmap(ratios,
cluster_rows = FALSE, cluster_cols = FALSE,
col = viridis(10),
breaks = seq(0, 1, length.out = 10)
)Yellow boxes indicate values slightly greater than 1, which may be observed. These are “good” (despite missing in the colour bar).
plotReducedDim()?Extension Challenge 3 Solution:
Above is a problem for you to solve; when you’re ready and have had an attempt yourself, click on “Show Answer” to reveal the code.
# Plot grid of empirical cluster swapping behaviour estimated by the bootstrap samples.
km_fun5 <- function(x) {
kmeans(x, centers = 5)$cluster
}
set.seed(42)
ratios5 <- bootstrapStability(pcs, FUN = km_fun5, clusters = cluster5$cluster)
pheatmap(ratios5,
cluster_rows = FALSE, cluster_cols = FALSE,
col = viridis(10),
breaks = seq(0, 1, length.out = 10)
)When k=5, we can see that the values on the diagonal of the matrix are smaller, indicating that the clusters aren’t exactly reproducible in the bootstrap samples.
Similarly, the off-diagonal elements are considerably lower for some elements. This indicates that observations are “swapping” between these clusters in bootstrap replicates